Skip to content

Migrate user_metadata_test.go to testcore.NewEnv#9672

Draft
chaptersix wants to merge 38 commits intotemporalio:mainfrom
chaptersix:user-metadata-test-env
Draft

Migrate user_metadata_test.go to testcore.NewEnv#9672
chaptersix wants to merge 38 commits intotemporalio:mainfrom
chaptersix:user-metadata-test-env

Conversation

@chaptersix
Copy link
Copy Markdown
Contributor

Summary

  • Migrate tests/user_metadata_test.go from testify/suite-based FunctionalTestBase pattern to testcore.NewEnv() subtests
  • Replace testvars.New() with s.Tv() and testcore.NewContext() with s.Context() in subtests
  • Follows the same approach as Migrate activity update tests to use TestEnv #9555

Test plan

  • make lint passes
  • go test -tags functional -run TestUserMetadata ./tests/ -count 1 -v -- all 3 subtests pass

…stcore.NewEnv

Replace suite-based test pattern with t.Run subtests using
testcore.NewEnv(t) for cluster lifecycle management. Use s.Tv()
for test variables and s.Context() for timeout contexts.
@chaptersix chaptersix marked this pull request as ready for review March 26, 2026 00:29
@chaptersix chaptersix requested review from a team as code owners March 26, 2026 00:29
@chaptersix chaptersix marked this pull request as draft March 30, 2026 22:05
ShahabT and others added 22 commits April 2, 2026 07:16
## What changed?
- Cleanup unused WV3 test code paths
- Enable use revision numbers config by default

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
None
## What changed?
1. tests/task_queue_stats_test.go — Remove the `t *testing.T` field from
`taskQueueStatsSuite` and replace all `s.t` usages with `s.T()`.
Critically, fix `validateDescribeWorkerDeploymentVersion` to use
`a.NoError(err)` instead of `require.NoError(s.T(), err)` inside the
callback.
2. tests/xdc/history_replication_signals_and_updates_test.go — Fix
`pollWorkflowResult` to handle errors from `GetWorkflowExecutionHistory`
gracefully: return the error from the inner function, retry with a nil
page token on transient errors (e.g. CurrentBranchChanged after conflict
resolution), and abort cleanly when the context expires.

## Why?
TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an
EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in
a separate goroutine on a ticker. If the outer test context expires and
the test completes while this callback is still running, the
require.NoError call invokes t.Fail() on a completed test, which panics
the entire test binary with panic: Fail in goroutine after TestXxx has
completed. The fix uses the CollectT-scoped a.NoError(err) so assertions
are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was
calling responseInner.History.Events without checking if responseInner
was nil. When GetWorkflowExecutionHistory returns an error (e.g.
CurrentBranchChanged after conflict resolution resets the branch token),
the response is nil, causing a nil pointer dereference that panics the
goroutine. Because the goroutine crashed before sending to
workflowResultCh, the main test goroutine blocked on <-workflowResultCh
for the entire 30-minute CI timeout. The fix returns the error from the
inner function and retries with a nil page token on transient errors,
which allows the poll to succeed on the updated branch.

## How did you test it?
- [ ] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The XDC fix changes `pollWorkflowResult` to retry indefinitely on any
non-context error. If a non-transient error were to occur repeatedly,
this could loop until the context expires rather than failing
immediately. This is acceptable since the context timeout provides a
hard bound, and the test failure message via `c.t.s.NoError(ctx.Err(),
...)` will make the root cause clear.
…poralio#9675)

## Summary
- Replace suite-based `AdminTestSuite` with `t.Run` subtests using
`testcore.NewEnv()` for better parallelization and simpler test
infrastructure
- Convert `rebuildMutableStateWorkflowHelper` to accept
`*testcore.TestEnv` instead of `*AdminTestSuite`
- Pass `EnableChasm` via `testcore.WithDynamicConfig` instead of
`OverrideDynamicConfig`

## Test plan
- [x] `go build` compiles cleanly
- [x] `make lint` passes
- [x] `go test -tags disable_grpc_modules,test_dep -run
TestAdminTestSuite ./tests/...` passes
…9676)

## Summary
- Migrate `activity_api_batch_reset_test.go` from suite-based pattern
(`FunctionalTestBase` + `suite.Run`) to `testcore.NewEnv` with `t.Run`
subtests
- Convert `createWorkflow` receiver method to standalone
`createBatchResetWorkflow` function
- Replace `s.TaskQueue()` with `s.WorkerTaskQueue()`

## Test plan
- [x] `go build -tags disable_grpc_modules,test_dep ./tests/...`
- [x] `go test -tags disable_grpc_modules,test_dep -timeout 5m -run
TestActivityApiBatchResetClientTestSuite ./tests/...`
- [x] `make lint`
…emporalio#9685)

## Summary

- Route `ErrClosed` from CHASM scheduler through to V1 fallback, so
operations on deleted CHASM schedules fall through to the
workflow-backed scheduler stack
- Add explicit `ErrClosed` checks in `Scheduler.Update` and
`Scheduler.Patch` since `UpdateComponent` does not reject mutations on
completed executions
- Add sentinel workflow detection in `describeScheduleWorkflow` to
return NotFound instead of QueryFailed when the V1 fallback hits a
sentinel dummy workflow
- Add functional tests for operations on deleted schedules (describe,
update, patch, delete) across both CHASM and V1 paths
- Add migration test verifying V1-to-V2 migration succeeds when a CHASM
schedule with the same ID was previously deleted (closed executions
allow business ID reuse)

## Test plan

- [x] `TestScheduleUpdateAfterDelete` -- update, patch, delete on closed
CHASM schedule via scheduler client
- [x] `testDeletedScheduleOperations` in shared schedule suite --
describe returns NotFound for both CHASM and V1
- [x] `TestScheduleMigrationV1ToV2WithClosedV2` -- V1-to-V2 migration
with previously deleted V2 schedule
- [x] Existing schedule migration tests still pass
…ralio#9382)

## What changed?
- Ensure the TQs receive and apply the right version data after revive.
- Made delete propagation to always happen serial to other propagations.
It ensures all other propagations are cancelled before starting delete
propagation.
- Deprecate the `deleted` flag in version data and the GC logic around
it. Now we use the good old forgetVersion path which immediately removes
the version data from TQ.
- Ensure version state is reset after revive, in case the recreation
happened before workflow close.
- Also, now workflows CaN based on SDK suggestion if no pending Signal
or Update is present.

## Why?
The version could stuck at deleted state from TQ POV if revived before
the (now deprecated) GC logic cleans it up.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
None

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…alio#9696)

## What changed?
Added support for `IS NULL` and `IS NOT NULL` operators in the
ListWorkers query engine.

- Implemented `evaluateIsExpr` in `worker_query_engine.go` to handle `IS
NULL` / `IS NOT NULL` for string and time fields
- For string fields: `IS NULL` matches empty values, `IS NOT NULL`
matches non-empty values
- For time fields (`StartTime`, `HeartbeatTime`): `IS NULL` matches when
the timestamp is not set
- Unsupported IS operators (`IS TRUE`, etc.) return a clear error
- Updated doc comment with the new supported operators

## Why?
To enable filtering workers by whether a field is set or not (e.g.,
`DeploymentName IS NULL` to find workers without a deployment).

## How did you test it?
- [x] built
- [x] added new unit test(s)

New tests:
- String field IS NULL / IS NOT NULL (DeploymentName with empty and
populated values)
- Time field IS NULL / IS NOT NULL (StartTime, HeartbeatTime with nil
and set timestamps)
- Composite queries combining IS NULL with AND/OR
- Error cases: unknown column rejected, unsupported IS operators (e.g.,
IS TRUE) rejected
…o#9705)

## Summary

- Adds two new metric labels (`temporal_worker_deployment_name` and
`temporal_worker_deployment_build_id`) to `approximate_backlog_count`
and related backlog metrics
- These decompose the existing `worker_version` label (which encodes
`deploymentName:buildId` as a single string) into two separate fields
- Enables downstream consumers like Kubernetes HPA via
prometheus-adapter to match on deployment name and build ID
individually, without hitting Kubernetes label length/format constraints

## Cardinality impact

**None.** The new labels are a 1:1 decomposition of the existing
`worker_version` value — every unique `(deployment_name, build_id)`
tuple maps to exactly one `worker_version` string. No new time series
are created.

The same `BreakdownMetricsByBuildID` dynamic config gates all three
labels:

| Queue type | `worker_version` | `temporal_worker_deployment_name` |
`temporal_worker_deployment_build_id` |
|---|---|---|---|
| Unversioned | `__unversioned__` | "" | "" |
| V3 versioned, gate **off** | `__versioned__` | "" | "" |
| V3 versioned, gate **on** | `myDeploy:build1` | `myDeploy` | `build1`
|

## Files changed

- `common/metrics/tags.go` — New tag constants and constructor functions
- `service/matching/physical_task_queue_manager.go` — Tags added to
handler (propagates to db.go emission sites)
- `service/matching/task_queue_partition_manager.go` — Tags added to
logical backlog emission sites + `parseDeploymentFromVersionKey` helper

## Test plan

- [ ] Existing matching service tests pass (CI)
- [ ] Verify labels appear in metrics output after deployment

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: changes are limited to metrics tagging and are gated by the
existing `BreakdownMetricsByBuildID` flag, but could affect downstream
metric queries/dashboards expecting the prior label set.
> 
> **Overview**
> Adds two new metrics tags, `temporal_worker_deployment_name` and
`temporal_worker_deployment_build_id`, alongside `worker_version`.
> 
> Propagates these tags through matching task queue manager/backlog
metric emission (including logical backlog and unload zeroing),
extracting deployment/build ID from the version key when applicable and
updating the affected matching metrics test expectations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
87a805d. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
## What changed?
Wait for tasks to actually load in fairness migration test.

## Why?
The waitForTasks tests only waited for metadata initialization, not for
tasks to actually load. We have to wait for them to load too, to prevent
a poller from grabbing an active task before draining.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests — ran 200 times in CI
- [ ] added new unit test(s)
- [ ] added new functional test(s)
…lio#9723)

## What changed?
Change `temporal_worker_deployment_name` -> `worker_deployment_name` and
`temporal_worker_deployment_build_id` -> `worker_build_id`.

## Why?
Found out that External Obs can add the prefix later, as it does for
other tags.
Also dropped `_deployment_` from the build id tag because the build id
has a stronger relation to the worker than the worker deployment, and
this is easier to say and type.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Zero risk. We added these labels last night. If we can patch this into
OSS (or just merge to main and restart Long Haul) no one will ever see
the tags we merged last night.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk change limited to renaming two metrics tag keys; the main
impact is dashboards/alerts or external tooling expecting the old tag
names.
> 
> **Overview**
> Renames the metrics tag keys for worker deployment identifiers in
`common/metrics/tags.go`, changing `temporal_worker_deployment_name` to
`worker_deployment_name` and `temporal_worker_deployment_build_id` to
`worker_build_id` for shorter labels.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
979c85f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## What changed?

Use atomic.Value with Swap to please the race detector for the race
between the watchMembership goroutine (writer) and gRPC goroutines
(readers).

## Why?
Data race in integration test

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles
the real-world race involving mutation of this field in real time,
especially as I am not very familiar with this code. Counting on my
reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via
watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a
load-close-store — two separate atomic operations, not one. There's a
window between close and store where a concurrent reader could load the
already-closed channel. This is actually fine for the intended
semantics:

If a reader loads the old channel (before or during close): it will see
the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the
next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a
closed channel, which immediately unblocks the select — correctly
signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to
close. If two membership changes fired in rapid succession (the channel
is buffered with size 1), could notifyNexusEndpointsOwnershipChange race
with itself? No — the comment on line 2761 states this method is only
called from the single watchMembership goroutine (the for range loop on
line 359). So the writer is single-threaded, and the load-close-store
sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity
is benign because any interleaving produces the correct outcome (the
closed channel signals ownership loss regardless of when a reader
observes it).

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
…emporalio#9708)

Change the default for FrontendNexusRequestHeadersBlacklist from empty
to block internal headers (accept-encoding, x-forwarded-for,
xdc-redirection, xdc-redirection-api, temporal-nexus-failure-support)
that shouldn't propagate to user Nexus handlers.
…oralio#9692)

## What changed?
testrunner writes intermediate JUnit XML reports after each completed
test attempt and enforces its own total deadline derived from the go
test -timeout flag.

## Why?
When a test run exceeds the GitHub Actions job timeout (40 min), the
entire process group is killed before the testrunner can write its final
merged XML, producing a synthetic junit.crash.xml showing
`functional-test (crash)`, with no visibility into actual test failures.

[Sample
run](https://github.com/temporalio/temporal/actions/runs/23567876357)
that actually had 108 failures on manual inspection

There are two distinct failure modes this addresses:

1. Killed between attempts — attempt 1 completes with failures, the
process is killed during attempt 2. Previously: no XML on disk. Now:
attempt 1's results survive in the intermediate file.
2. Go test timeout doesn't fire — the test binary hangs past its own
`-timeout=35m` (e.g. deadlocked goroutines). Previously: GitHub kills at
40 min with no XML. Now: the testrunner enforces the same 35m deadline
itself, kills `gotestsum`, extracts any `--- FAIL:` lines captured from
stdout, writes a report, and exits cleanly within the 5-minute buffer
before GitHub's kill.

## How did you test it?
- [X] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [X] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The testrunner's context deadline fires at the same time as go test's
own -timeout. If both fire simultaneously, parseTestTimeouts (existing
path) and the new `ctx.Err()` check (new path) can both trigger. The
`ctx.Err()` check is guarded to only activate if the normal
timeout-panic path didn't already break the loop, so there is no
double-processing.
…lio#9377)

## What changed?
Add archetype metric tag and logical task processing metrics, and add
`chasm_pure_task_requests` and `chasm_pure_task_errors` metrics.
Archetype metric tab will be applied for all tasks executed, physical
and logical, that get added in Executable.go by default. If the
archetype is found, the archetype display name will be used for the tag
value. Otherwise, will default to `workflow` archetype.

For chasm logical pure task metrics, will resolve to `__unknown__` if
archetype is not found, since this would be unexpected behavior.

## Why?
Components and therefore task types can be used across different
archetype executions. In the future with more archetypes, there needs to
be visibility for which archetype created the task for execution.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)
- [ ] added new functional test(s)
## What changed?

Moved decision which d2 layout engine to use into file.

## Why?

Allows each file to pick the right layout engine for itself.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…poralio#9743)

## What changed?
* Upgrade the `actions/create-github-app-token` and `actions/setup-go`
because of deprecations notices for NodeJS
* Change the test summary to `!cancelled` so successful runs will still
show the failed tests in Summary.

## Why?
Deprecations warnings for the first bullet point. For the second bullet
point it's helpful to see the flaky tests on a successful run.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Tests only
## What changed?
This PR implements the server-side plumbing for Principal Attribution -
attaching a server-computed, immutable caller identity (Principal) to
workflow history events.
* Extended the Authorizer Result struct to include a Principal struct
* Added the authorization interceptor logic to propagate Principal via
gRPC metadata headers when enabled via dynamic config
(frontend.enablePrincipalPropagation)
* The default authorizer sets principal based on sub JWT claim
* On the history service side, Principal is read from context headers
and stamped onto every history event
* Principal is set per-operation at two entry points:
UpdateWorkflowWithNew (existing workflows) and NewWorkflowWithSignal
(new workflows)
* Added Principal headers to the Nexus disallowed operation headers
blocklist to prevent leaking internal identity to external endpoints

## Why?
The existing identity field on history events is client-supplied and
spoofable. Principal Attribution provides a trusted, server-computed
identity derived from authentication context (e.g. JWT claims, mTLS
certificates), enabling
 reliable audit trails for workflow operations.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)

I ran a simple test with local JWT server (PR here:
temporalio#9647). There a test client
sends a signal to a running workflow using a separate JWT identity, then
waits for the workflow to complete. Here is the workflow history:

| eventId | eventType | principal |
  |---------|-----------|-----------|
  | 1 | WORKFLOW_EXECUTION_STARTED | jwt/workflow-starter |
  | 2 | WORKFLOW_TASK_SCHEDULED | jwt/workflow-starter |
  | 3 | WORKFLOW_TASK_STARTED | jwt/workflow-starter |
  | 4 | WORKFLOW_TASK_COMPLETED | jwt/workflow-starter |
  | 5 | WORKFLOW_EXECUTION_SIGNALED | jwt/signal-sender |
  | 6 | WORKFLOW_TASK_SCHEDULED | jwt/signal-sender |
  | 7 | WORKFLOW_TASK_STARTED | jwt/workflow-starter |
  | 8 | WORKFLOW_TASK_COMPLETED | jwt/workflow-starter |
  | 9 | ACTIVITY_TASK_SCHEDULED | jwt/workflow-starter |
  | 10 | ACTIVITY_TASK_STARTED | jwt/workflow-starter |
  | 11 | ACTIVITY_TASK_COMPLETED | jwt/workflow-starter |
  | 12 | WORKFLOW_TASK_SCHEDULED | jwt/workflow-starter |
  | 13 | WORKFLOW_TASK_STARTED | jwt/workflow-starter |
  | 14 | WORKFLOW_TASK_COMPLETED | jwt/workflow-starter |
  | 15 | WORKFLOW_EXECUTION_COMPLETED | jwt/workflow-starter |

## Potential risks
- Feature-gated: Principal propagation is behind
frontend.enablePrincipalPropagation dynamic config (default off), so no
impact unless explicitly enabled.
- Nexus blocklist addition: Adding principal-type/principal-name to the
Nexus disallowed headers blocklist could reject operations that happen
to use these header names
…alio#9493)

## What changed?
In some cases for the gcloud connector the writer could be left open if
the io.Copy call failed. This condition occurs when the gcloud client
returns an error due to 429s or other error responses. In this case the
writer is never closed, when the intention of the code is just to ensure
the most relevant error is surfaced.

This change ensures that in this case the correct error is surfaced and
an attempt is made to close the writer.

## Why?
A flurry of 503s and 429s from google made me look at this code, and I
became aware this may be a memory leak

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
This is a minor change, very low risk on an error path

---------

Co-authored-by: Vladyslav Simonenko <vlad.simonenko@temporal.io>
## What changed?
Exclude polling cases from get apis

## Why?
There are long polling cases in the Get APIs. We have trouble to
identify those cases in regular Get APIs.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed?

Added `parallelsuite` package to provide an alternative to testify's
suite.

**For reviewer: Use "Hide Whitespace" to make the diff smaller.**

## Why?

1. **Parallel by default**: all test methods and subtests run in
parallel automatically
2. **IDE support**: GoLand run gutters, navigation, and test discovery
work on Test* methods
4. **Misuse detection**: guard catches mixing assertions with Run(),
both directions, at runtime
5. **Type-safe subtests**: `s.Run` callbacks receive the concrete suite
type, so suite helpers are available
6. **No shared state**: each test method and subtest gets its own fresh
copy
  
## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)
)

## What changed?
Add new review guidelines for test reliability and fix instances found
in the current codebase.

## Why?
I have found these to be the most common patterns across the last two
weeks of targeting flaky tests.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Minimal
## What changed?

Adds `buf format` as a Makefile target; and integrates it into `make
fmt`.

All `.proto` changes are from running `make fmt`.

## Why?

Consistent protobuf file style.

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
stephanos and others added 14 commits April 2, 2026 07:16
## What changed?

Deprecate `WithSdkWorker` testEnv option and init it lazily instead.

## Why?

Requires less manual setup.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

Co-authored-by: Alex Stanfield <13949480+chaptersix@users.noreply.github.com>
## What changed?

As titled.

## Why?

Migrating to the latest and greatest, allowing parallel test runs.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

```
longtran@Longs-MacBook-Pro:~/work/temporal (branch: migrate-to-testenv!)
[1] $ go test ./tests/ -run TestLinksTestSuite                                                                                                                                     ✘
ok      go.temporal.io/server/tests     1.148s
```
Automatically generated by the optimize-test-sharding workflow.

Co-authored-by: Temporal Data <commander-data@temporal.io>
## What changed?
WISOTT

## Why?
Safety and runtime increases from parallelizing tests.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
NA
## What changed?

Changing `s`s to `env`s.

ie migrates some of the tests that were already using testEnv but not
yet parallelsuites.

**Tip: Use "Hide Whitespace" for review**

## Why?

Progress is inevitable.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed and why?
### **Security**

- Check namespaces in batch workflow
([CVE-2026-5199](https://www.cve.org/cverecord?id=CVE-2026-5199), LOW)

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [x] added new functional test(s)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Tightens namespace validation for privileged batcher activities to
prevent cross-namespace request spoofing; mistakes here could block
legitimate batch jobs or still leave gaps if other call paths use
unvalidated namespaces.
> 
> **Overview**
> Adds strict namespace isolation checks to the batcher worker:
`BatchActivityWithProtobuf` now validates that `NamespaceId` *and* any
provided `Request.Namespace`/`AdminRequest.Namespace` match the worker’s
bound namespace, and consistently uses that bound namespace for
downstream frontend calls (including reset-by-type).
> 
> Adds targeted tests: new unit tests to reject mismatched namespace
strings and to assert `startTaskProcessor` uses the worker namespace for
signals, plus a functional test ensuring a batch terminate in one
namespace does not affect workflows in another.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
3a52696. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Sean Kane <sean.kane@temporal.io>
temporalio#9091)

## What changed?
Remove nil search attributes from history events / mutable state / chasm
node

## Why?
Search attributes can be explicitly set as null on the following events:

* Workflow Start

* Upsert Search Attributes

* Continue-as-new

Starting a workflow or continuing a workflow as new with a nil search
attribute records the `{key: nil}` mapping in mutable state and the
history event.

Upserting search attributes erases any current search attribute value
from persistence (mutable state and visibility record).

Upserting the search attribute as nil in the SDK retains the mapping in
the WorkflowInfo as a {key: nil} mapping. On a continue-as-new event,
SDK sends this mapping to Server, and the search attribute gets
repopulated with the null search attribute in the history event and
mutable state.

This last scenario can cause inconsistency between the expected search
attributes and what is recorded in history. After upserting a nil search
attribute, the mutable state and visibility record no longer shows this
search attribute. But, once continuing as new, the search attribute
reappears, leading to unexpected NDE if the search attribute is also
deregistered.

## CHASM change
This also removes nil search attributes from CHASM nodes on replacement.
This mirrors the change to Workflow History Events and Mutable State.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)
- [X] added new functional test(s)
## What changed?
Replace plain `int` with `atomic.Int32` in two VersionWorkflowSuite unit
tests.

## Why?
CI caught this as a crash in the unit test job

Failure
[here](https://github.com/temporalio/temporal/actions/runs/23856744785)
```
 WARNING: DATA RACE                                                                                                                                                                                      
  Read at 0x00c000b005e8 by goroutine 320:                                    
    go.temporal.io/server/service/worker/workerdeployment.(*VersionWorkflowSuite).Test_MultipleSyncStates_BlocksCaNUntilAllComplete.func2()                                                               
        version_workflow_test.go:1240 +0x2c                                                                                                                                                               
  Previous write at 0x00c000b005e8 by goroutine 326:                                                                                                                                                      
    ...version_workflow_test.go:1240 +0x3c  
```

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
NA
## What changed?

1. Updated Go SDK to latest version per
https://pkg.go.dev/go.temporal.io/sdk?tab=versions, latest I see today
is `v1.41.1` from Mar 17 2026
2. Ran `go mod tidy`
3. Some tests were failing due to SDK interface changing, ran `cd
common/testing/mocksdk && go generate` -- generated changes in
`common/testing/mocksdk/client_mock.go`
4. Fixed Nexus test suite due to breaking changes and deprecated APIs in
new SDK -- changes in `tests/nexus_workflow_test.go`

## Why?

To use the latest version of our SDK and keeping tests up-to-date.

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
Only replicating as part of a namespace creation, not on namespace
update.

## What changed?
Explicitly including ReplicationState in the replication of namespace
creation

## Why?
Currently ReplicationState is not replicated on namespace creation,
leading to cases where the Active cluster is state:Normal and the
Standby cluster is state:Undefined".

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
New section of code to me, hard to assess but this contains minimal
changes.

Testing:
1. make start-dependencies 
2. make install-schema-xdc
3. make start-xdc-cluster-a
4. make start-xdc-cluster-b
5. temporal --address 127.0.0.1:8233 operator cluster upsert
--frontend-address "127.0.0.1:8233" --enable-connection
--enable-replication
6. temporal --address 127.0.0.1:8233 operator cluster upsert
--frontend-address "127.0.0.1:7233" --enable-connection
--enable-replication
7. temporal --address 127.0.0.1:8233 operator namespace list 
8. temporal --address 127.0.0.1:7233 operator namespace create --global
--active-cluster cluster-a --cluster cluster-a --cluster cluster-b -n
demo_namespace
9. temporal --address 127.0.0.1:8233 operator namespace list

Output for 9:
Branch:
% temporal --address 127.0.0.1:8233 operator namespace list
  ReplicationConfig.State               Normal

Main:
% temporal --address 127.0.0.1:8233 operator namespace list
  ReplicationConfig.State               Unspecified

---------

Co-authored-by: stuart-wells <stuart.well@temporal.io>
## What changed?
Instead of dropping expired tasks immediately, pass them through
mergeTasksLocked and treat them as already-acked, so that the read level
and ack level can be updated appropriately.

## Why?
As the previous comment suggests, reading a batch of entirely expired
tasks would lead to incorrect behavior: since the acks were not placed
in outstandingTasks, the read level wouldn't be updated, and we'd
re-read the expired tasks in a loop, i.e. the reader would be stuck.
(Unless a task was written within that range, that would unstick it.)
Even if we read a batch that wasn't all expired, but ended in a long
stretch of expired tasks, we wouldn't update the read level as much as
we could.

This adds expired tasks as acks, and then adds a call to
advanceAckLevelLocked, so the read level can be pushed up, and we can
advance the ack level over a stretch of expired tasks as well.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s) – and tested it fails without the fix
- [ ] added new functional test(s)
…on (temporalio#9774)

## What changed?
Reorder code so that the non-blocking `ReadComponent` call uses the
original context, and context with the altered deadline is used for the
long-poll only

## Why?
It looks like it was a mistake; the code state after this commit appears
to be what was originally intended.

## How did you test it?
- [x] covered by existing tests

## Potential risks
Could break describe for standalone activity if this change was
misconceived.
@chaptersix chaptersix marked this pull request as ready for review April 2, 2026 12:17
@chaptersix chaptersix requested review from a team as code owners April 2, 2026 12:17
@chaptersix chaptersix marked this pull request as draft April 2, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.